Hardware support for Local Memory Transactions on GPU Architectures

نویسندگان

  • Alejandro Villegas
  • Ángeles Navarro
  • Rafael Asenjo
  • Oscar Plata
  • Rafael Ubal
  • David Kaeli
چکیده

Graphics Processing Units (GPUs) are popular hardware accelerators for data-parallel applications, enabling the execution of thousands of threads in a Single Instruction Multiple Thread (SIMT) fashion. However, the SIMT execution model is not efficient when code includes critical sections to protect the access to data shared by the running threads. In addition, GPUs offer two shared spaces to the threads, local memory and global memory. Typical solutions to thread synchronization include the use of atomics to implement locks, the serialization of the execution of the critical section, or delegating the execution of the critical section to the host CPU, leading to suboptimal performance. In the multi-core CPU world, transactional memory (TM) was proposed as an alternative to locks to coordinate concurrent threads. Some solutions for GPUs started to appear in the literature. In contrast to these earlier proposals, our approach is to design hardware support for TM in two levels. The first level is a fast and lightweight solution for coordinating threads that share the local memory, while the second level coordinates threads through the global memory. In this paper we present GPU-LocalTM as a hardware TM (HTM) support for the first level. GPU-LocalTM offers simple conflict detection and version management mechanisms that minimize the hardware resources required for its implementation. For the workloads studied, GPU-LocalTM provides between 1.25-80X speedup over serialized critical sections, while the overhead introduced by transaction management is lower than 20%.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improvements in Hardware Transactional Memory for GPU Architectures

In the multi-core CPU world, transactional memory (TM) has emerged as an alternative to lock-based programming for thread synchronization. Recent research proposes the use of TM in GPU architectures, where a high number of computing threads, organized in SIMT fashion, requires an effective synchronization method. In contrast to CPUs, GPUs offer two memory spaces: global memory and local memory....

متن کامل

Cache and Bandwidth Aware Matrix Multiplication on the GPU

Recent advances in the speed and programmability of consumer level graphics hardware has sparked a flurry of research that goes beyond the realm of image synthesis and computer graphics. We examine the use of the GPU (graphics processing unit) as a tool for scientific computing, by analyzing techniques for performing large matrix multiplies in GPU hardware. An earlier method for multiplying mat...

متن کامل

An Efficient Multiway Mergesort for GPU Architectures

Sorting is a primitive operation that is a building block for countless algorithms. As such, it is important to design sorting algorithms that approach peak performance on a range of hardware architectures. Graphics Processing Units (GPUs) are particularly attractive architectures as they provides massive parallelism and computing power. However, the intricacies of their compute and memory hier...

متن کامل

Optimizing power efficiency for 3D stacked GPU-in-memory architecture

With the prevalence of data-centric computing, the key to achieving energy efficiency is to reduce the latency and energy cost of data movement. Near data processing (NDP) is a such technique which, instead of moving data around, moves computing closer to where data is stored. The emerging 3D stacked memory brings such opportunities for achieving both high power-efficiency as well as less data ...

متن کامل

Predictable transactional memory architecture for hierarchical mixed-criticality systems

A transactional memory simplifies the concurrency management in multicore systems by permitting sets of load and store instructions to be executed in an atomic way. The correct results for concurrent transactions and the execution time strongly depend on the coherency potentials, rollback capabilities and strategies of the transactional memory. A transactional memory can be implemented as a Har...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015